An enhanced three-stage design with trend analysis for allergen immunotherapy trials

We previously introduced a three-stage design and associated end-of-stage analyses for allergen immunotherapy (AIT) trials. End-of-stage differences alone may not provide a fuller picture of Stages 2 and 3 effects because they may depend upon stage-specific durations. Therefore, we introduce an additional trend analysis to evaluate the difference in progression curves of two groups over the entire stage. Results from such analysis are used to inform persistence of end-of-stage benefit and thus provide evidence for stagewise effects beyond the study periods. We jointly apply end-of-stage and trend analyses to support the enhanced three-stage design to determine treatment response over time and sustained response to AIT. A simulation study was performed to illustrate the statistical properties (bias and power) of trend analyses under varying statistical missing mechanisms and effect sizes. The extent of bias depended on the missing mechanism and magnitude. Powers were largely driven by effect and sample sizes as well as pre-specified success margins, particularly of relative trend. As an illustration, assuming relative treatment differences of 25–30%, stagewise dropout rate of 15%, and parallel outcome progressions, a sample size of 200 per group may achieve 97% power to demonstrate a treatment effect and 53% power to demonstrate a sustained effect post-treatment. Trend analysis supplements the end-of-stage analysis to enhance the statistical claims of stagewise effects. Inferential statistics support our proposed trend analysis for evaluating benefits of AIT over time and inform clinical understanding and decisions.


Introduction
We previously introduced a three-stage design for allergen immunotherapy (AIT) trials to characterize the effects of an AIT product through stagewise statistical evaluations [1].Briefly, Stage 1 is a double-blind placebo-controlled randomized (DBPCR) study to evaluate an initial treatment effect in participants who received the active treatment compared to those who received placebo.If an initial treatment effect is demonstrated by the end of Stage 1, the trial will continue to Stage 2. In Stage 2, participants in the treatment group in Stage 1 continue to receive treatment (referred to as original treatment group hereafter), while the initial placebo recipients cross over to receive the treatment (referred to as crossover group hereafter) in an open-label fashion.Upon completion of Stage 2, the trial will continue to Stage 3. Stage 3 is a post-treatment follow-up study, in which all participants are followed for a pre-determined length of time to evaluate whether the treatment effect is maintained after treatment is discontinued.Such design preserves the initial randomization which allows further evaluations of the magnitude and duration of the treatment responses at later stages while provides timely access to active therapy for placebo recipients which potentially reduces dropouts due to perceiving little or no immediate benefit as well as differential dropouts for reasons related to treatment in later stages.
Currently, most AIT trials with long-term follow-up have focused on descriptive statistics.However, as described in Tang et al. (2021) [1], following conventional analysis strategy, comparative analysis of data collected at the end of each stage was proposed to evaluate the stagewise effects.These end-of-stage analyses can provide information on an initial treatment effect (Stage 1 effect), an effect associated with time of introduction and length of treatment (Stage 2 effect), and a potential sustained effect after treatment is withdrawn (Stage 3 effect).However, end-of-stage analyses are subject to stage-specific durations.In other words, end-of-stage differences may be impacted by the duration of the underlying stage.For example, in a hypothetical three-stage trial (Fig 1), an initial treatment effect is demonstrated by the end of Stage 1.The trial continues to Stage 2. At the end of Stage 2 (time point t2), the difference between two groups meets the pre-specified success margin of -15%, and thus a Stage 2 effect is claimed.However, if we extend the duration of Stage 2 from time point t2 to t2', following the same trend the difference may diminish at time point t2', and thus a Stage 2 effect will not be claimed.Similarly, after the treatment is withdrawn, the difference observed at the end of Stage 3 (time point t3) could also diminish when the two progression lines converge at time point t3', an extended duration time point from t3.Therefore, the lengths of the stages are important aspects in consideration when we describe and interpret the comparative results in the three-stage design.
Since the lengths of stages, particularly for Stages 2 and 3, are often constrained due to budget and operational consideration, analysis of the outcome progression would provide insights for expected stage effects beyond the underlying study periods.In the literature, comparing the progression of outcome between two groups has been suggested to infer persistence of a treatment effect for diseases that progress slowly such as Parkinson's disease and Alzheimer's disease [2][3][4].Therefore, given the circumstances when comparative results from end-of-stage analyses alone do not fully describe the stagewise effects, we further introduce trend analyses, in which the progression curves of the two groups over the entire Stage 2 (or 3) are compared and evaluated additionally.These complementary analyses would provide further information for understanding the profile and clinical benefit of an AIT and assist decision-making in clinical practice.
The remainder of this paper is organized as follows.The "Materials and Methods" section describes the notations and statistical hypotheses for Stages 2 and 3 trend analyses, the statistical methods for estimating the trend differences, the relationship between end-of-stage differences, trend differences and test margins, and the set-up of the simulation study to evaluate the performance of the proposed trend analyses under various levels of Stage 2 or 3 effect and missing mechanisms.The "Results" section summarizes the simulation results in terms of estimates, type I error rates, and powers for Stages 2 and 3 trend analyses.The "Discussion" section summarizes the contributions of the trend analyses in aiding interpretation of the Stages 2 and 3 results and describes the caveats of the proposed methods and future research directions.

Trend analysis
Because progression can be best described when study outcome is periodically assessed, we focus on AIT trials where repeatedly measured outcomes are available.We follow the same notations used in Tang et al. (2021) [1].For Stage i = 1, 2, 3, X ti and X pi are end-of-stage i responses for the original treatment and crossover groups respectively, and S i is end-of-stage difference measured by absolute difference (X ti −X pi )or relative difference ½100 � ðX ti À X pi Þ=X pi �.In addition, assuming a linear response curve, we denote β ti and β pi the slopes of the original treatment and crossover groups in Stages i = 2 and 3, respectively.The between-group trend differences during Stages 2 and 3 (denoted R 2 and R 3 ) are then measured by the slope ratios β t2 /β p2 , and β t3 /β p3 , respectively.
Table 1 shows the proposed stagewise statistical hypotheses for three-stage trials enhanced from Tang et al. (2021) [1].Hypotheses 1, 2a, and 3a are formulated for end-of-stage analyses to test for differences at the end of Stages 1, 2, and 3 (S 1 , S 2 andS 3 ), respectively.In addition, the response curves over Stages 2 and 3 are evaluated in trend analyses through the noninferiority comparisons described in Hypotheses 2b and 3b.During Stage 2, we expect to see improving trends (i.e., upward trend if higher outcome is better or downward trend if higher outcome is worse) in both groups as AIT is given, noninferiority of the slope suggests that the rate of improvement in the original treatment group is no less than that in the crossover group.Therefore, a lower bound margin is used in the hypothesis test.On the other hand, during Stage 3, we expect diminishing effect trends (i.e., downward trend if higher outcome is better or upward trend if higher outcome is worse) in both groups as AIT is withdrawn.Noninferiority of the slope suggests that the rate of diminishing effect in the original treatment group is no more than that in the crossover group.Therefore, an upper bound margin is used in the hypothesis test.
For example, if there exists a constant distance between the two progression curves, this will indicate that the end-of-stage benefit will persist, and particularly not diminishing, regardless of the duration of the stage.Therefore, with the incorporation of trend analysis in Stages 2 and 3, statistical claims of Stage 2 or 3 effect would reflect hints on the persistence of difference between two groups beyond the underlying study period.

Trend estimation
Estimation of the end-of-stage differences S 1 , S 2 and S 3 are described in Tang et al. ( 2021) [1].We focus on estimation of the Stage 2 and 3 trend differences R 2 and R 3 in this paper.We take estimation of the Stage 2 trend difference R 2 as an example.Let N and K denote the number of participants and time points available for Stage 2 trend analysis.If we observe the repeatedly measured outcome x jk for subject j at time point k during Stage 2 where j = 1,. ..,N, and k = 1,. ..,K, assuming a linear trend, a mixed model can be fitted to the repeatedly measured outcome during Stage 2 as follows: where Group is a dummy variable indicating the group assignment for subject j (Group = 1 for the original treatment group and 0 for the crossover group), Time is a continuous variable representing the time point at which the outcome x jk is measured, u j is a random effect within subject j, and ε jk is the error term, for j = 1,. ..,N, and k = 1,. ..,K.
After fitting model (1), we obtain the coefficient estimates b0 ; b1 ; b2 , and b3 , and the estimated variance-covariance matrix ; where ŝ2 p is the estimated variance for bp and ŝpq is the estimated covariance between bp and bq ðp; q ¼ 0; 1; 2; 3Þ.The slopes of the original treatment and crossover groups during Stage 2 are estimated as bt2 ¼ b2 þ b3 , and bp2 ¼ b2 , respectively.The trend difference R 2 is then estimated as the slope ratio bt2 . The standard error of the estimated trend difference can be computed using the delta method [5] as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The 100(1-α)% confidence interval (CI) around the estimated Stage 2 trend difference is con- where z α/2 is the upper α/ 2-quantile for the standard normal distribution.The null hypothesis 2b is rejected if the lower limit (LL) of the 100(1-α)% CI around the estimated Stage 2 trend difference is greater than λ 2 .Estimation of the Stage 3 trend difference R 3 and its standard error follows the same statistical approaches as described above.Similarly, the null hypothesis 3b is rejected if the upper limit (UL) of the 100(1-α)% CI around the estimated Stage 3 trend difference is less than λ 3 .

Relationship between end-of-stage differences, trend differences and test margins
In this section, we attempt to use Stage 2 analysis as an example to show intertwined relationship between end-of-stage differences, trend differences and associated test margins following the notations introduced in the "Trend analysis" subsection.
Under linear trends, the Stage 2 trend difference is measured by slope ratio (R 2 ).When end-of-stage 1 and 2 differences (S 1 and S 2 ) are measured by absolute difference, the relationship between S 1 , S 2 , and R 2 is shown below: where t s2 indicates the duration of Stage 2. Alternatively, when end-of-stage 1 and 2 differences are measured by relative difference, the relationship is as follows: Based on Eqs (2) and (3), the Stage 2 trend difference is related to several components: endof-stage 1 difference (S 1 ), comparator Stage 2 slope (i.e., slope of the crossover group during Stage 2), duration of Stage 2, and most importantly end-of-stage 2 difference (S 2 ).As a result, when end-of-stage 1 data are available and if linear progression is a plausible assumption, there exists a linear relationship between the two quantitative metrics for measuring a Stage 2 effect (S 2 and R 2 ), given the comparator slope and duration of Stage 2.
The relationships between two effects (S 2 and R 2 ) shown above can be further rendered in the relationship between two test margins (δ 2 and λ 2 ) which are the fixed thresholds representing clinically meaningful differences.When higher outcome is worse, as shown in Table 1, the two success criteria 2a and 2b for claiming a Stage 2 effect are S 2 <δ 2 and R 2 >λ 2 , respectively.We expect to see downward trends for an improving outcome as AIT is given for both groups during Stage 2, which accord with negative slopes during Stage 2 (i.e., β p2 <0).Incorporating S 2 <δ 2 into Eq (2) or (3), success criterion 2a becomes � is greater (or less) than λ 2 , the success criterion 2a (or 2b) would drive the sample size/power, respectively.When higher outcome is better, the relationship is similar but in a reversed way.
The above discussions can be similarly applied to the respective parameters in Stage 3. The purpose of this section is to provide theoretical background for explaining some of the findings described in later sections of the paper.As another overarching purpose, they may also be used as a reference in determining clinically meaningful margin for the treatment difference at the end of the trial given the understanding of the progression profiles of the two groups, or vice versa.

Trial simulations
A simulation study was carried out to illustrate testing procedures at different stages and to evaluate the performance of the proposed trend analyses under various levels of Stage 2 or 3 effect and missing mechanisms [missing at random (MAR) or missing not at random (MNAR)].We employed the same three-stage trial to evaluate the benefits of an AIT in reducing the total combined rhinitis score (TCRS) at 1-, 3-, and 5-years as described in Tang et al. (2021) [1].Fig 2(A) shows study durations and frequencies of measured outcomes of the simulated trial.Specifically, study durations for Stages 1 through 3 were 12, 24, and 24 months, respectively.Average TCRS measurements were repeatedly obtained every three months starting from Stage 2.
End-of-stage and trend differences were measured by relative difference and slope ratio, respectively.The success margin for end-of-stage differences was chosen to be -15% (i.e., δ 1 = δ 2 = δ 3 = -15%).The non-inferiority margin was chosen to be 80% and 125% for Stages 2 and 3 trend differences (i.e., λ 2 = 80% and λ 3 = 125%), respectively.Under such settings, the success criteria 2a and 2b (corresponding to hypotheses 2a and 2b) for declaring a Stage 2 effect were that the UL of the two-sided 95% CI around the relative difference at the end of Stage 2 was less than -15%, and the LL of the two-sided 95% CI around the slope ratio for Stage 2 was greater than 80%, respectively.Similarly, the success criteria 3a and 3b for declaring a Stage 3 effect were that the UL of the two-sided 95% CI around relative difference at the end of Stage 3 was less than -15%, and the UL of the two-sided 95% CI around the slope ratio for Stage 3 was less than 125%, respectively.
As shown in Fig 2(A).the average TCRS was 8.0 at baseline and reduced to 6.0 and 7.5 at the end of Stage 1 for the original treatment and crossover groups, respectively.Thus, in this simulation, an administration of AIT for the first 12 months was associated with an improvement of 20% relative reduction in TCRS [i.e., (6.0-7.5)/7.5 = -20%].scenarios of different levels of Stage 2 or 3 effect.For Stage 2 scenarios, the slope of the original treatment group was fixed at -0.125.We then assigned a slope ratio of 0.80 (= noninferiority margin for Stage 2), 0.85, 0.90, and 1.00 for Stage 2 scenario NULL, A, B and C, respectively, demonstrating an increasingly stronger Stage 2 effect.Similarly, the slope of the crossover group was fixed at 0.055 for Stage 3 scenarios.We then assigned a slope ratio of 1.25 (= noninferiority margin for Stage 3), 1.1, 1.0 and 0.9 for Stage 3 scenarios NULL, D, E, and F, respectively, demonstrating an increasingly stronger Stage 3 effect.For illustration purposes, Stage 3 scenarios were investigated following Stage 2 scenario C, and thus the initial score at the beginning of Stage 3 was 3.00 and 4.50 for the original treatment and crossover groups, respectively.As explained in the previous subsection, under the current settings, Stage 2/3 success would be dominated by success criterion 2b/3b under all scenarios.
Under each of the underlying scenarios, 10,000 trials were simulated for a sample size of 200 per group.Data were generated following the same methods as described in Tang et al. ( 2021) [1].Model (1) was fitted to estimate the relative difference and slope ratio for each stage.First-order autoregressive correlation structure was introduced when fitting the model to account for longitudinal measurements from the same individual.Trend differences were estimated as described in the "Trend estimation" subsection.Analyses were conducted on both the observed data and ten sets of multiply imputed data from multivariate imputations by chained equations.Estimates from analysis of each set of multiply imputed data were combined following Rubin's rules.

Results
With a sample size of 200 per group, there was approximately 97% power to demonstrate Stage 1 effect (relative symptom score reduction is more than 15%) if the true improvement is 20% relative reduction as shown in Fig 2(A).
Estimates and absolute biases of the relative difference and slope ratio under various Stage 2 or 3 effect are shown in Table 2.Both Stages 2 and 3 estimates were generally unbiased under MAR.Under MAR, analyses of imputed data resulted in slightly larger absolute biases compared to those from analyses of observed data.The absolute biases obtained under MNAR were noticeably larger compared to those obtained under MAR.Under MNAR, Stage 3 estimates tended to be more biased as more participants dropped out, with larger absolute biases compared to those of Stage 2 estimates.
Since estimates tended to be biased under MNAR, we chose to present the powers for Stage 2 or 3 success under MAR only (Fig 3).Fig 3(A) shows the powers for meeting one success criterion (2a or 2b) or both success criteria under various Stage 2 scenarios.As noted above, type I error rates under the null scenario were largely inflated for meeting success criterion 2a while preserved at the nominal level for meeting success criterion 2b or both success criteria.The powers for meeting success criteria 2a were above 92%.The powers for meeting success criterion 2b/both criteria were 20%, 63%, and 96% from analysis of observed data under Stage 2 scenarios A, B, and C, respectively.Analysis of imputed data resulted in slightly less powers (i.e., 15%, 54%, and 94%, respectively), possibly due to slightly larger biases and standard error estimates after multiple imputation.Fig 3(B) shows the powers for meeting one success criterion (3a or 3b) or both success criteria under various Stage 3 scenarios.We observed a similar pattern in type I error rates.The powers for meeting success criterion 3a were above 96%, and the powers for meeting success criterion 3b/both criteria were approximately 26%, 53%, and 78% under Stage 3 scenarios D, E, and F, respectively.

Discussion
In our previous report [1], we introduced the framework of a three-stage design for stagewise evaluations of the efficacy of an AIT product.The proposed framework provides end-of-stage inferential statistics to inform clinical understanding of an investigative AIT over time.Whether the comparative results from end-of-stage analyses alone are adequate to support a Stage 2 or 3 effect is debatable, because end-of-stage difference between two groups may depend on the duration of Stage 2 or 3. Ideally, Stages 2 and 3 should be of sufficient duration such that the treatment effect is manifested during Stage 2, and a sustained effect can be captured at the end of Stage 3, respectively.However, if Stage 2 or 3 are curtailed for any reason, effect may be falsely claimed based on end-of-stage difference alone.
Here we introduce a trend analysis to evaluate the difference in progression curves of two groups over the entire stage.For example, if there exists a constant distance between the two progression curves (i.e., equivalent slopes), this will indicate that the end-of-stage benefit will persist, and particularly not diminishing, regardless of the duration of the stage.Since equivalence of slopes between the two groups over the period of Stage 2 or 3 indicates that the endof-stage benefit does not diminish over time, trend analysis may provide more vigorous evidence for a Stage 2 or 3 effect than end point alone.Together, end-of-stage and trend analyses of Stage 2 data can solidify information on how time of introduction and length of treatment will impact the response to treatment over time to inform an optimal duration.Subsequently, joint results from end-of-stage and trend analyses of Stage 3 data will inform of the duration of benefit after treatment is withdrawn-i.e., whether there is a sustained effect.Since multiple hypothesis testing are proposed across stages, discussions may be necessary with respect to multiplicity issues.We designed the continuation of the study in later stages to be dependent on a Stage 1 success, and thus, type I error control is in place via gate-keeping strategy from Stage 1 to Stages 2/3.Furthermore, since the two hypotheses in Stage 2 (2a and 2b) or Stage 3 (3a and 3b) are intersection-union testes, type I error rate is controlled at a conventional significance level (i.e., 0.05) for Stages 2 and 3 analyses, respectively (shown in our simulation results).We consider a controlled type I error rate for each of Stages 2 and 3 analyses adequate and acceptable because analyses of Stages 2 and 3 data are indented to provide information on the timing and duration of benefit on and off immunotherapy, respectively.In case where clinical decision/firm conclusions are to be made at the end of Stages 2 and 3, because of the chronological nature of the design resulting in hypothesis testing being conducted sequentially for the two stages, we consider it meaningful and acceptable to address the unique and equally important objective of each stage (one after another) at a conventional significant level as if they were separate studies.
There remains caveat to the study outcome and statistical approaches that we chose.A continuous outcome was repeatedly measured in our simulation study.For simplicity, a linear trend was assumed for Stages 2 and 3 such that the pattern of the effect over the entire stage could be evaluated by a slope ratio.However, the assumption of a linear trend may not be appropriate.For example, the symptoms of seasonal allergy are cyclical and depend upon pollen levels, which may be unpredictable and vary from year to year.When a linear assumption is implausible, other statistical measures such as area under the curve (AUC) may be appropriate to capture the disease burden during the entire stage.In addition, a non-parametric model may be an appropriate alternative to evaluate a shift manifesting treatment effect.Therefore, the type of model for trend analysis and the statistical measure for capturing the trend should be specific to the allergic disease under investigation.For example, the outcomes and models for food and environmental allergy may not be the same.
Furthermore, periodic assessments may be available so that the time course of the effect over the entire stage can be analyzed in the trend analysis.However, when the study outcome is not repeatedly measured, it may not be clear whether the comparative results from end-ofstage analyses alone sufficiently support the conclusion of a sustained effect.
Finally, these proposed analysis strategies are designed towards predicting trends of the benefit of different durations of treatment (i.e., treatment versus control in Stage 1) as well as persistence of that benefit after withdrawal of treatment.While these may point towards a binary decision as to whether the proposed treatment is beneficial, trends may also guide investigators toward study design choice such as length of each stage and clinically meaningful difference margins.These trends may also be useful for clinicians towards choosing optimal lengths of treatment and observation periods after withdrawal.

Fig 2 .
Fig 2. (A) Study durations and frequencies of measured outcomes of the simulated trial; (B) Eight scenarios of different levels of Stage 2 or 3 effect.X t2 = Average TCRS of the original treatment group at the end of Stages 2. X p2 = Average TCRS of the crossover group at the end of Stages 2. X t3 = Average TCRS of the original treatment group at the end of Stages 3. X p3 = Average TCRS of the crossover group at the end of Stages 3. β t2 = Slope of the original treatment group during Stage 2. β p2 = Slope of the crossover group during Stage 2. β t3 = Slope of the original treatment group during Stage 3. β p3 = Slope of the crossover group during Stage 3. Note: All Stage 3 scenarios were investigated following Stage 2 scenario C. https://doi.org/10.1371/journal.pone.0291533.g002

Table 2 . Estimates and absolute biases of the relative differences and slope ratio under Stages 2 and 3 scenarios.
MAR = missing at random; MNAR = missing not at random Note: Refer to Fig 2(B) for the various effects under Stages 2 and 3 scenarios.https://doi.org/10.1371/journal.pone.0291533.t002